Generalized Poland-Scheraga model for DNA hybridization 
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The Poland-Scheraga (PS) model for the helix-coil transition of DNA considers the statistical 
mechanics of the binding (or hybridization) of two complementary strands of DNA of equal length, 
with the restriction that only bases with the same index along the strands are allowed to bind. In 
this paper, we extend this model by relaxing these constraints: We propose a generalization of the 
PS model which allows for the binding of two strands of unequal lengths A^i and N2 with unrelated 
sequences. We study in particular (i) the effect of mismatches on the hybridization of complementary 
strands (ii) the hybridization of non complementary strands (as resulting from point mutations) of 
unequal lengths Ni and A2. The use of a Fixman-Freire scheme scales down the computational 
complexity of our algorithm from 0(Ai A^|) to 0(A''i Af2).The simulation of complementary strands 

Oof a few kbps yields results almost identical to the PS model. For short strands of equal or unequal 
lengths, the binding displays a strong sensitivity to mutations. This model may be relevant to the 
I experimental protocol in DNA microarrays, and more generally to the molecular recognition of DNA 

fragments. It also provides a physical implementation of sequence alignments. 

PACS numbers: 87.14.Gg; 87.15. Cc; 82.39.Pj 
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' Natural DNA exists as a double helix bound state [IJ. Upon heating, the two complementary strands may separate. 

I This unbinding transition is called DNA denaturation (see 0). The reverse process of binding is called renaturation, 

recombination, or, in more biological words, recognition between strands. 
' An important puzzle is how the extreme selectivity required by the biological machinery can be achieved in spite 
^ \ of the very high entropy of non selective binding. For instance, in a DNA microarray, single strands of DNA are 
. grafted on a surface. When this array is immersed in a solution containing complementary and mutated strands, the 
' recognition process occurs with a high accuracy, with a seemingly low rate of errors 13, 0| . 

The Poland-Scheraga (PS) model |2i IS 0> B aims at describing DNA denaturation in a simplified way: In a 
nutshell, the double helix is described as a succession of bound fragments separated by unbound bubbles (loops). 
, More specifically, the PS model assumes that the two DNA strands are exactly complementary, and that only bases 
■ with the same index can form pairs (implying that the strands have equal lengths N). As will be shown below, its 
Q ' computational time scales like N"^ . Use of the Fixman-Freire scheme reduces the scaling to 0{N), allowing for 

the study of melting of long sequences (up to a few Mbps) 10]. For a general review on DNA denaturation, we refer 
_ the reader to llj. It has also been argued that in some cases, this model can be used to detect coding regions in 
linear (non-circular) DNA T^. 

ILj^ The aim of this paper is to generalize the PS model by relaxing its main constraints, namely (i) allow the 

strands to be of unequal length A^i and N2 and (ii) allow the strands to be non complementary. As a result, any 
base of strand 1 can pair with any base of strand 2 (crossings of base pairs being excluded) . A preliminary account 
of this work can be found in ref.[l3. In this generalized model, the computational time scales like N1N2 but again, 
a Fixman-Freire scheme brings it down to 0{NiN2)- A very similar complexity reduction was obtained in the case 
of circular DNA For practical purposes, this limits the present implementation of our algorithm to sequences of 
up to a few kbps. 

This paper is organized as follows: in section ^ we review the standard PS model, using partition functions ^ 
rather than conditional probabilities Q ■ In section IIIII we generalize the recursion equations for the appropriate 
partition functions in order to include the possibility of mismatches, unequal lengths and non complementary strands. 
In section Hvl the algorithm is first applied to complementary strands of eukaryotic DNA of a few kbps and compared 
with the standard PS algorithm: We check that there are essentially no mismatches, thus validating the assumptions 
of the PS model for complementary DNA strands of equal length. We have then tested the algorithm with short 
sequences, in order to model the hybridization of sequences of unequal lengths as it occurs in DNA microarrays 
Our simulations show that molecular recognition is both very selective for complementary fragments of the strands 
and very sensitive to single point mutations in the sequences. Our algorithm also describes the physical process of 
(complementary) sequence alignment. 
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II. THE POLAND-SCHERAGA MODEL 
A. Recursion relations 

Although the original Poland-Scheraga model was developped for homopolymeric strands 0, 0, 0| , we will focus 
on realistic (heteropolymcric) DNA sequences. Exact recursion relations have been derived by Poland 0, using 
conditional and thermodynamic probabilities. Here, we follow an equivalent approach using partition functions, 
which turns out to be easier to generalize. 

We first consider two complementary strands of equal length N, and we denote by Zf{a) the forward partition 
function of the two strands, starting at base (1) and ending at base (a), with bases (a) being paired. We model the 
interactions of base pairs by stacking energies {ea^a+i;p,i3+i), which are known to describe nucleotides interactions in 
a more accurate fashion than simple base pairing. These stacking energies account in particular for screened Coulomb 
interactions and for hydrogen bonds between Crick- Watson pairs, and depend on pairs of adjacent bases on the two 
strands, the pair (a, a + 1) belonging to strand 1 and the complementary pair (3+1) belonging to strand 2 (with 
P = a in the PS model). Since the strands are complementary, only 16 stacking energies out of 4** = 256 possible 
terms turn out to be non zero. In Appendix IdI we give the values of the 10 different stacking energies used in the 
program MELTSIM ^3 ■ These energies (which depend on the salt concentration) will be used throughout this paper. 

1 a+1 1 



1 a+1 1 

FIG. 1: Recursion relation for Zf{a + 1) (eq.Q) in the PS model. 

To find the recursion relation obeyed by Zf{a + 1), we notice that there are three ways to bind a pair of chains of 
length a + 1 : Either the last pair (a, a + 1) is stacked, or there is a loop starting at any (a') {1 < a' < a — 1) and 
ending at {a + 1), or there is no loop (Figure^. 

The forward partition function therefore satisfies 

a-l 

Zf{a + 1) = e-^'" Zf{a) + as J2 ^/("')A/'(2(a + 1 - a'j) + as Mia) (1) 

c«'=l 

where P — l/ksT is the inverse temperature, Ea = £a.a+i-a,a+i is the stacking energy of base pairs (a, a + 1), as is 
the bare loop formation (cooperativity) parameter and as is the bare free end formation parameter (we assume that 
these parameters are base independent). The factor J\f{2{a + 1 — a')) counts the number of conformations of a chain 
starting at base {a') and ending at base {a + 1) and is asymptotically given by 

AA(2(a + l-a')) = Ai"""7(a-a') (2) 

where ks log/z is the entropy per base pair and f{x) = ^ is the probability of return to the origin of a loop of length 
2x. We assume that the entropy factor /x does not depend on the chemical nature of the base pair. The exponent c 
depends on the interaction of the loop with the rest of the chain: It has been extensively discussed in the context of 
homopolymeric DNA 001^3' equal to 3/2 for non-interacting Gaussian loops, to ~ 1.8 for non-interacting 

self-avoiding loops, and to « 2.15 for interacting self-avoiding loops. As stated above, eq.® is vahd only for large 
enough loops. For shorter loops, one can use different formulae such as eq. (20) of ref.flj. A more accurate way to 
account for short loop entropies would be to have a look-up table as is currently done in RNA folding 0, ^| . To 
the best of our knowledge, this has not yet been implemented for DNA. 

The last term on the r.h.s. of eq. 1^ represents the contribution of unbound extremities: The factor A4{a) counts 
the number of conformations of a pair of unbound chains starting at base (1) and paired at base (a + 1) and is 
asymptotically given by 




M{a) = fi^gia) 



(3) 
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where g{x) = For non-interacting Gaussian chains c = 0, and numerical evidence points to c 0.09 for self- 
avoiding chains |20l |. 

In a similar way, we denote by Zb{a) the backward partition function of the two strands, starting at base {N) and 
ending at base (a), with base (a) being paired. To find the recursion relation obeyed by Zh{a), we again notice that 
there are three ways to bind a pair of chains at base (a), starting from base (N). The backward partition function 
therefore satisfies 

N 

Zb{a) = e-'^''' Zbia + l)+as Zb{a'W{2{a - aj) +as MiN - a) (4) 

The probability p{a) that base pair (a) is bound can then be expressed as 



Zf{a)Zb{a) 



(5) 



where Z is the thermodynamic partition function of the two strands. Restricting ourselves to configurations with at 
least one bound base pair (i.e. we do not consider dissociation), we may express Z as (Figure EJ 



N- 1 
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FIG. 2: Graphical representation of the thermodynamic partition function Z (eq.Q) in the PS model. 



Z = Zf{N)+os{^i9{l)Zf{N-\) + ^i''g(2)Zf{N-2) + ■■■ 

+ M^-2.g(7V - 2)Zf{2) + M^-\g(iV - (6) 

or equivalently as 



Z = Zb{l)+as{lJig{l)Zb{2)^ ii^g{2)Zb{2,) + --- 

+ ^^''-^g{N - 2)ZbiN - 1) + fi^'-'giN - l)Zb{N)) (7) 

For our purposes, we now define 

Z}{a)^fi-^Zf{a) (8) 

and 

Z*ia)^ti-^^-''+'^Zb{a) (9) 

so that the the recursion relations read 



a-l 

Z}{a + 1) = e-''^=-i°s'' Z}{a) + do ^ Z}{a')f{a - a') + aig{a) (10) 

a' = l 

and 



JV 

Z^(a) = e-^^=-i°s^ Z^(a + l) + ao ^ Zt{a')f{a'-a^l) + a^g{N^a) (11) 

a'=a+2 
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where ctq = — and cti = — . These equations, deahng with partition functions, are equivalent to Poland's probabilistic 



approach as sketched in Appendix 151 
Equation Q now reads 



^Z}{N) + a, Z}{a)g{N - a) 

The fraction 9ps of bound basepairs is then given by 

N 

N 



= .^t:;-iL; . — r (12) 



1 ^ 



This quantity can be measured by UV absorption at 268 nm [H. The derivative -dOps/dT with respect to tem- 
perature displays sharp peaks at the temperatures where various fragments of the sequence open. For a homopolymcr, 
the fraction 9ps is proportional to the internal energy of the chain, and thus —dOps/dT is proportional to the specific 
heat. For non homogeneous sequences, it can be easily checked that the peaks of the specific heat also coincide with 
those oi-dOps/dT. 

Since all partition functions are calculated with at least one bound pair, one has to include the possibility of 
strand dissociation , i.e. of two unbound strands (Appendix^l. As seen in experiments llj, strand dissociation is 
particul arly important for small N fragments : The corresponding calculation of the fraction of dissociated and bound 
strands [ij is given in Appendix IXI Denoting by Oh the fraction of bound strands, the total fraction of bound pairs 
(the quantity which is measured experimentally) is given by = ObOps- We will consider here only 9ps- 

B. Practical implementation 

1. Approximation for g{x) 

We have mentionned above that g{x) = x~'^. We are not aware of any simulation with c ^ 0. In order to compare 
our results with previous approaches, we therefore set c = (non-interacting Gaussian value), so that g{x) = 1 in eq. 
(trniTTI and US). 

The recursion relations for the PS model become 



Z}{a + 1) = e-^^"-i°g^' Z}{a) + fio J] Z}{a')f{a - a') + (14) 

a' = l 

and 

N 

Ztia) = e-^'--'°^^ Ztia + 1) + a„ ^ Z*,{a')f{a' ~a-l) + ai (15) 

where f{x) = ^. 

Accordingly, the probability p{a) that base pair (a) is bound is calculated as 



Z}{a)Zl{a) 

P(") = -; N—^ (16) 

iz;(iv) + aiEL7^/(«) 



with the fraction of bound pairs Opg = X]q=i p(q^)- 



2. The Fixman-Freire scheme 



From a practical perspective, solving numerically equations (|14I15|I requires a CPU time of order iV^, since one has 
to calculate 0{a) terms for each value of a. The Fixman-Freire (FF) method reduces this CPU time by approximating 
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the loop factor f{x) of equations H14I15|) by 

/(a;) = l^^a, e-^^- (17) 

i—1 

In equation H17() the number / of couples (ai^i) depends on the desired accuracy. The parameters {ai,bi) are 
determined by a set of non-linear equations (see [g])- 

For a sequence of length N = 2000, the choice 1 — 9 gives an accuracy better than 0.5% and we have adopted 
this value throughout this paper. Larger values (/ = 14) are used in for lengths of order 150 kbps and in the 
program MELTSIM which implements Poland's recursion relations ^ with a FF scheme. The CPU time of the 
FF scheme scales down the computational cost from 0{N'^) to 0{N x /), as shown by equations (|B8I) and l|B13|) of 
Appendix ^ 

3. Values of the parameters 

In equations (|14I15|I . one needs the values of the entropy factor /i, the stacking energies (eq, — Sa,a+i;a,a+i), the 
exponent c of the loop factor IjlTI) and the loop formation (cooperativity) and free end formation effective parameters 
ctq and ai. 

For complementary strands, the stacking energies we have used are the ones of MELTSIM [l3|; we have also 
adopted the value log/x = 12.5047 of this program (see Appendix ID|) . Point mutations, when present, are assigned a 
zero stacking energy : In all our numerical calculations, we have indeed checked that the results do not depend on 
the precise value of the stacking energy of the mutated pair, as long as it is larger than half of the typical unmutatcd 
stacking energies, i.e. « — 2500°if . 

Our calculations have been done with the Flory value c = 1.8, and the MELTSIM value of the cooperativity 
parameter cto = 1.26 10^^ foi' the free end parameter cti, we have followed reference ^^'^ taken 

f?i = ~ 3.5 10^'^. This set of parameters will be hereafter referred to as standard. 

The exponent c and cooperativity parameter (Tq have given rise to some discussions [2ll l22j. However, we did not 
find in the literature any discussion on the role of ai . This is why we have tested other values of the parameters such 
as c = 2.15, ao = 1.26 10"'' and cti = 1. For the cases studied in this paper, the changes are rather small. We find 
for instance that, as long as cti is non-zero (in fact > 10^^), its value is quite irrelevant (cti = corresponds to the 
case of paired extremities). 

The boundary conditions for the recursion equations H14() and H15() . as well as their practical implementation are 
exposed in Appendix IBl 



III. GENERALIZING THE PS MODEL 
A. Equations 

We now generalize the PS model in different ways: We allow for unequal strand lengths denoted by A''i and A''2, 
and non complementarity of the sequences. This in turn implies that one must allow for pairing of any base (a) of 
strand 1 with any other base (/3) of strand 2 (while forbidding the crossing of base pairs) |23|. Further, we allow loops 
(with a factor 175), only if there is at least one unpaired base on each strand. Finally, we associate a factor of unity, 
instead oias for the pairing of extremities (bases {Ni) with (/3) or (a) with {N2)) (see Figure 0J|. 

At this stage, it should be noted that since the generalized Poland-Scheraga model (GPS) includes all configurations 
from the original PS model, its free energy Fcps{T) is necessarily lower than that of the PS model Fps{T) 

Fgps{T) < Fps{T) (18) 

We denote by Zj:{a, (3) the forward partition function of the two strands, starting at base (1) and ending respectively 
at base (a) (strand 1) and at base {(3) (strand 2), bases (a) and (/3) being paired. We further denote by Zf,(a,/3) the 
backward partition function of the two strands, where strand 1 (resp. strand 2) starts at base (iVi) (resp. (-/V2)) and 
ends at base (a) (resp. (/?)), bases {a) and (/?) being paired. Keeping the same notations as in the PS model, and 
setting Ea-fj = Ea.a+i-ffjj+i, thcsc partition functions satisfy the recursion relations (Figure|3I) 
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FIG. 3: Recursion relation for Zf{a + l,/3 + 1) (eq.^T^l) in the GPS model. 



Zf{a + 1,(3+1) = e~^'-'PZf{a,(3) 

a-l 0-1 

+ as ^ ^ Zf{a\l3')U{a + 1 - a' + ^ + 1 - 

a' = l /3' = 1 

+ osM{a,l3) (19) 



and 



Zb{a,f3) = e-'3^-''Zfc(a + l,/3+l) 

+ <JS £ Zb{a',(3')U{a' -a + P' -0) 

a'=a+2 /3'=/3+2 

+ asMiNi-a,N2- 13) (20) 

where Af{x) = fi^~^f{^ — 1) and M.{x,y) = fi~r^ g{^^Y^) . Since we have used an entropy factor of fc^ log/i per base 
pair, we have assigned a factor ^ log /i per free base. 

The probability p(a, /?) that base (a) of strand 1 is paired with base (/3) of strand 2 is then expressed as 



p(a,/?)^ ^/("'^)f^("'^) (21) 

where Z is the thermodynamic partition function of the two strands. Equations H19() and (|20|l show that the compu- 
tational complexity of the generalized model is 0(iVf iV|). 

As in the PS model, we take from now on c = (i.e. g{x) = 1). Restricting ourselves to configurations with at 
least one bound base pair, we may then express Z as (Figure^ 




FIG. 4: Graphical representation of the thermodynamic partition function Z (eq.((22J) in the GPS model. 



Ni-l JV2-I 

Z = Zf{Ni,N2)+ fi^Zf{a,N2)+ ^ ^l—Zf{Nl,(3) 

a=l p=l 

+ E E M - Zf{a,(3) (22) 

ct=l 13=1 

Note that the thermodynamic partition function Z can also be expressed in terms of Zi,{a,(3). 
In complete analogy with the PS model, we define 
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Z}ia,(3) = ^i~^Zfia,P) (23) 



and obtain 



+ -0 E E ^;(a', /?') /( "'"^t^'^' ) + J?i (24) 



q' = 1/3' = 1 

We similarly define Z^{a,f3) = ^ 2 '-Zi,{a,P) and get 



Z;(«,/3)Z*(a,/3) 



where 

Ni-l JV2-I Afi-lAr2-l 



^ Ni-l N2-I N1-IN2-I 

Z* = -Z}{Ni,N2)+ E Z}{a,l)+ J2 Z}{l,P)+a, ^ E (26) 

Q = l /3=1 Q = l /3 = 1 

Since we have allowed for the pairing of any (a) with any (/5), we have to define the equivalent 9qps of the PS 
order parameter Ops- For this purpose, we define for each base (a) of strand 1, the base (/3o(a)) of strand 2 which is 
maximally bound to (a), that is 

p(a,/3o(a))= max {p{a, (3)) ^ Pmax{a) (27) 
The fraction of maximally bound pairs then reads 



1 

Sgps = ]^ E P™"^ (28) 



a=l 



where N — min(iVi, N2)- 

The number of mismatched pairs is defined as 



Ni N2 

iVA/ = E E ^) ~ ^GPS (29) 
a=l 13=1 



where A^gps = YflLiPmax{a) 



As in the PS model, one has to include the possibility of strand dissociation (Appendix^. Denoting by Ob the 
fraction of bound strands, the total fraction of (maximally) bound pairs is given by = ObOcPS- We will consider 
here only 0gps- 

B. Practical implementation 

We use the same parameters and the same FF method as in the PS model. This amounts in particular to set 
fix) = ^ ~ '^i e~^^^ in equations (|24|l . We mostly focus on the boundary conditions and algorithmic aspects 

of the resulting recursion relations. 
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1. Boundary conditions 

We write the list of boundary conditions relevant to the recursion relations of equations (|24|l . The corresponding 
strands configurations can be easily deduced from that list. We thus have, for a — 1,2, ...Ni and (3—1,2, ...N2 

Z}{1,P) = 

Z}{a,l) = 
ZUNi,P) = 
Zt{a,N2) = /x-i 

When stacking comes into play, for a = 2, 3, ...A^i and (3 — 2,3, ...N2 we have 

Z}{a,2) = e-^''''-'''-^°^^'Z}{a- 1,1) + 

and 

Z*(iVi-l,/3) = e-^-'"i-i^'^"'°s^Z*(A^i,/3+ 1) +5i 
Z*{a,N2-l) = e-'''-'^2-i-^°sf^Z*,{a + l,N2)+ai 



for a = Ni-l,Ni- 2,. ...,2,1 and (3 ^ N2-1,N2- 2,. ...,2,1. 



2. Algorithmics 

We expect the forward and backward partition functions to grow exponentially with the number of basepairs. 
As in the PS case (see Appendix 0, we introduce recursion relations for the logarithms (free-energy like) of these 
functions, to avoid underflows or overflows in the computations. Generalizing the PS approach of Appendix ^ we 
define Qi(oi,(3) and iii{a,(3) by 

0.(a,/3) = E E Z;(a',/3')e''-^ = e^-^e''-^-^) (30) 

a' = l /3' = 1 

We obtain 

Z}{a, (3) = e-'''^ (g,(a, /?) + Q,(a - 1, /3 - 1) - Q,{a - 1, /?) - Q,{a, P - 1)) (31) 

or equivalently 

Z}{a, (3) = e''-^"'^) + e-'>^e'''^'^-i.p~i) _ ^-ii ^ ^^,(a-i,/3)^ ^32) 

Using equation (|24fl with (/(x) — ~ Si=i e"''*^), one finally obtain a recursion relation for the linearly growing 

fJ.i{a,P) 's 



fi,{a + 1,(3+1)^ fi,{a, (3) + log{E + F + G + H) (33) 



where 



E = 


I. ''i 




F = 


g-/5£c,;/3-l0g/J 1 


;i + e-M.K/3)(e-fc.eM.(a-i,/3-i) _e 


G = 


/ 


gMfc(a-l,/3-l)-Mi(a,/3) 




k=l 




H = 
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The boundary conditions pertaining to equation l|33l) can be obtained from section B and are given in 
Appendix 10 

One similarly defines 

i?.(a,/?) = E E Zl{a\P')e-'^'^ = ^-..i^^ (34) 

a'=a f3'=l3 

which yields 

Z^{a, P) = e^'^ /?) + i?,(a + f , + 1) - i?,(a + f , /3) - /? + 1)) (35) 

or equivalently 

Z*(a, P) = e'^^^^^P^ + e-^'e'''("+i'^+i) - (e''-("'^+i' + e^'^^+i'^)) (36) 
and using equation with (/(x) — ^ ~ Ei=i e"''*^), one gets 

v,{a,(3) =^v^{a + l,(3+l)+\og{E' + F' + G' ^H') (37) 

where 

E' = _^g--^g-i'.(«+l,/3+l)(-gi',(a+l,/3) ^gi..(a,/3+l)^ 

^' ^ g-i^i(Q+l,/5+l) |-g-6igi.,(Q+2,/3+2) _ g--^ |-gZ..(a+2,/3 + l) _^ gi/. (a + 1,/3+2) j j 

/c=l 

where r = (,-Pe^,f,-\ogt, 

The boundary conditions pertaining to equation l|37j) can be obtained from section B \\ . and are given in 
Appendix [C| 

Equations (ESJ and JSTI) show that the FF scheme brings down the CPU cost from 0{NfN^) to 0{NiN2 x /). 

The knowledge of the ^i{a,(}) and i^i{a,0), or of the Zj{a,(3) and Z^(q;,/3) (through equations (|32l36|l ). enables 
us to calculate various thermodynamical properties of interest, including the probability p{a, (3) of pairing of bases 
(a) and (/3) (see eq. (j^HDi or the fraction of (maximally) bound pairs Oops (see eq. (|^). 

IV. RESULTS 

We have studied the recursion relations of the GPS model and compared them, whenever possible, with their 
standard PS counterpart. In this paper, we will present a few examples, leaving a systematic study for a future work. 

A. Medium length sequences 

In this section, we show a comparison of the two algorithms in the case of medium length sequences (A^i — N2 — 
N — 1980). This sequence was extracted from chromosome four of Drosophila melanogaster Using the standard 
set of parameters of section Hi B 31 we have first studied the binding transition of the two complementary strands: We 
show in FigureElthe PS and GPS results for , where 9ps = jfJ2a=iP(.'^) ^^'^ ^gps = jfY.a=iPmax{a)- 

As is clear, the two curves coincide over a wide range of temperature, except close to the final unbinding peak 
((T„ 87.2°C), where fluctuations allowed by the GPS model enhance the peak. 

To illustrate this point, we plot the total number of bound pairs N9ps and NOgps around the unbinding temper- 
ature Tu ^ 87.2°C (FigureEJ: For instance at 88°C, the PS model overestimates the total number of bound pairs by 
about 60. 

In addition, a study of the number Nm of mismatches as a function of temperature indeed shows that Nm ^ 
0, T < T^. Since the partition function Z includes configurations with at least one bound base pair, one finds 
Nm ~ 1, T > T„. 
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FIG. 5: The specific iieat — ^ for complementary sequences of lengtii Ni — N2 = 1980 for (i) the PS model (*) and (ii) the 
GPS model (full line). A slight difference is observed for the final peak (r„ ~ 87.2°C). 





FIG. 7: Blow-up of the two central peaks of Figure|^for the GPS model (i) standard parameters (*) (ii) parameters of reference 
[2^ (thick line). 
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FIG. 8: Blow-up of the two central peaks for the GPS model (i) complementary strands (thick line) (ii) a mutation in the 
middle of strand 1 (o). 



We have studied several sequences with the same results, thereby validating the hypothesis that DNA denaturation 
can be modeled by the PS model (a = /3), implying a very strong selectivity of molecular recognition. 

We have further considered the influence of the (c, (To,(^i = \/^) parameters in the framework of the GPS model. 
Following a recent proposal p^ . we have compared the standard set of parameters with the set (c = 2.15, (Tq = 



1.26 10"*,CTi 

The results are shown in Figure[71 As in the standard PS model, the dependence of the results on these parameters 
is very weak, in agreement with reference |22j . 

Finally, we have considered the influence of a point mutation on strand 1 within the GPS model, and using standard 
parameters. The corresponding stacking energies were set to zero. In Figure |H| we show the effect of single point 
mutation in the middle region of strand 1: it slightly shifts the whole curve towards lower temperatures. The effect 
of multiple mutations and the effect of their location on the strands will be studied in a future work. 



B. Hybridization of short fragments and the influence of mutations 

As mentioned in the introduction, the hybridization of short DNA fragments is of interest for DNA microarrays. 
We have compared the PS and GPS models for complementary fragments of identical lengths Ni = N2 = 30. In that 
case, there is only one peak in — around this peak, the situation is very similar to that observed around the last 
peak Tu of the previous section. The PS model overestimates the number of bound pairs by about 5 basepairs. 

Using the GPS model, we have also studied the hybridization of strand 1 (A^i = 30) taken from the same drosophile 
chromosome 24] , with a fragment of length N2 = 70 containing the complementary of strand 1 in its middle section 
(Figure infa). The first and the last 20 bases of strand 2 are taken from a different region of the same DNA fragment. 
We have then studied the same system, with a point mutation (see Figure |2Ib)), in the middle of strand 1 (X) or 
close to its extremities (O). 





(a) 



(b) 



FIG. 9: Recognition of two strands of different lengths (A'^i = 30, = 70) (a) Fragment 1 of strand 2 is complementary to 
strand 1 (b) One creates a mutation on strand 1, either in the middle (X, a — 15) or at the end (O, a — 5). 
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FIG. 10: The effect of mutations X (thin hne) and O (dashed line) on the no mutation situation (thick line). 

In Figure [TUl we plot — ^ for the three cases mentioned above. For short sequences, the effect of mutation (X) 
in the middle section of the strand is important: the curve has two maxima instead of one, corresponding to the 
opening of the two subfragments of the strand. The effect of mutation (O) is much weaker since it is located near 
the extremity of strand 1. This general feature has been checked on many different choices of strand 1. The physical 
origin of this phenomenon is easy to understand: since ai ^ 0, fluctuations are larger near the ends of the strands, 
and since the extremities of strand 1 are nearly molten, the effect of a mutation in this region is very weak. In Figures 
(|11I12I13|I . we plot the opening probability (l — Pmax{c()) along strand 1, for various temperatures and for the cases 
of no mutation, mutation (X), and mutation (O). 

1.0 

1-Pja) 
0.8 

0.6 

0.4 

0.2 



'■'0.0 10.0 20.0 30.0 

a 

FIG. 11: Opening probability (1 — pmax{a)) along strand 1 for temperatures t=60, 74.2, 82.2, 85.6, 90.2, for the no mutation 
case. 

As stated in section the study of short fragments and loops, would ideally require the use of look-up tables for 
the entropies, rather than the asymptotic formula In addition, we have checked that our results do not depend on 
the precise value of the stacking energy of the mutated pair, as long as it is larger than half of the typical unmutated 
stacking energies, i.e. « —2500°K. 
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FIG. 12: Opening probability (1 — pmax{a)) along strand 1 for temperatures t=60, 74.2, 82.2, 85.6, 90.2, for mutation (X) at 
a = 15. 
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a 

FIG. 13: Opening probability (1 — pmax{a)) along strand 1 for temperatures t=60, 74.2, 82.2, 85.6, 90.2, for mutation (O) at 
a — 5. 

V. CONCLUSION 

We have presented a generalization of the Poland-Scheraga model, which allows for the pairing of any bases on two 
DNA strands of unequal lengths. The resulting algorithm has been implemented within a Fixman-Freire scheme. The 
examples that we have given emphasize the fact that, for realistic sequences, the Poland-Scheraga model captures the 
essence of the DNA strands binding selectivity. In this respect, the generalized model is particularly useful in certain 
specific situations (such as tandem repeats 25]). To further illustrate the strong selectivity of real sequences, we 
compare the number Nm of mismatches, as given by equation (|29|l . for homopolymeric and realistic complementary 
strands of length N = 1980 (using the standard set of parameters) in Figure PPil 

Our approach is also useful for strands of unequal lengths, in particular for DNA niicroarrays. The role of mutations 
remains to be systematically studied. 
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FIG. 14: Number of mismatched pairs Nm (eq. (HHJ) ^ ^ function of temperature for homopolymeric (aaaa...) and (tttt...) 
strands of length A'^ = 1980. For comparison, the heteropolymeric case studied in section llV Al is shown as the heavy dot-dashed 
hne, lying close to the temperature axis. 
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APPENDIX A: DISSOCIATION EQUILIBRIUM OF DOUBLE STRANDED DNA 

In this appendix, we show how the thermodynamic fraction of dissociated DNA chains can be computed. Consider 
a solution of M single strands of DNA (denoted Si) of size TVi and M single strands of DNA (denoted S2) of size N2 
in a volume V . The total concentration of single strands in the solution is 

2M 

These strands can associate (or hybridize) into double stranded DNA (denoted D) according to the chemical reaction 

D ^ Si + S2 

We denote by zi the partition function of a single strand Si and by Z2 that of a single strand S2 both with fixed 
origin. In a purely entropic model for the single strands, these partition function would read 



^1,2 



but of course, one can use more realistic models. 

Let z denote the partition function of two hybridized strands D (with fixed origin), i.e. strands with at least one 
pair of bound bases. This last partition function z is nothing but the total partition function Z calculated in section 

El 

A generic configuration of the solution consists of p hybridized strands D, M — p single strands Si and M — p single 
strands S2. The partition function of the system can thus be written as 



^ ^ (Z1Z2)^^ 2M-P 



In the thermodynamic limit {M — > 00) the sum in eq. IjAip is dominated by the value of p which maximizes the 
generic term. Using the Stirling formula, the equation for p reads 

{M-pf _ Z1Z2 , , 

pv - ~ ^^^> 

Introducing the dissociation constant K of the Law of Mass Action 

^ [Si].[S2] 



we have 



Denoting by 9b the fraction of bound strands 



[D] 

{M~pf 
pV 



21 Z2 

K = — 

z 



(A3) 



equation ljA2p is equivalent to the quadratic equation 



^2 - 2 ( 1 + 0b + 1 = (A4) 
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which imphes 



e„.l,IL-J(!^)\2^ (A5) 

Ct ^ \CtJ Ct 

Note that there is a factor 4 in our eq. I|A4|I compared to eq. (34) of ref. 1^ due to the fact that we consider the 
two strands as distinct objects. 

APPENDIX B: PRACTICAL IMPLEMENTATION OF THE PS MODEL 
1. Boundary conditions for the PS model 

Equation 1)14(1 apphes for a > 2, and equation ((15|l for a < N — 2. We must initiate the recursion relations with 
boundary conditions on Z*j{m) for m = 1, 2, and on Z^{m) for m ~ N, N — 1. A similar situation exists in Poland's 
probabilistic approach 

Since cti 7^ 0, these boundary conditions are important. We have 

Zfil) = 1 

Zf{2) = e-^'^Zf{l)+^sf^ (Bl) 

which gives 

Z)il) = 

Z}{2) = e-^^'-^°s''Z}{l)+ai (B2) 



Similarly, we have 



yielding 



Zb{N) =. 1 

Zb{N~l) = e-f''"-'Zt,{N)+asfi (B3) 



2. Algorithmics 

Inserting equation H17() in 114(1 . we have 



Z}ia + 1) = e-'^^"-'°sA< z}{a) + ao ^ a, ^ e-''("-"')z;(a') + a, (B5) 



2—1 a' — 1 



As expected, and made explicit in equation (|B5|1 . the growth of Zj{a) is exponential with a. To deal with (free- 
energy like) quantities that are linear in a, we define Qi{a) and by 

a 

Q^{a) = ^ Z;(a')e^'"' = e'''"e^-(") (B6) 

a' = l 

and get 

Z*{a) = e-^'"(Q,(a) - Q,{a - 1)) = eM°''^ - e-^'ef'^"-'^^ (B7) 
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Using equation ljB5|) , one finally gets a recursion relation on the (linearly growing) fii (a) 's 

Hi{a + l)=fi,{a)+log{A + B + C + D) (B8) 

where 

A = e-"^ 

— g-/3£c.-logM^ _ g-''igA'i(a-l)-A'i(a)^ 
/ 

fc=l 

£> = CTie-^'(") (B9) 
Equation (|B8|I is to be iterated with boundary conditions 

= iogz;(i) = -\og^l 

11,(2) = log(e-'''Z;(l) + Z;(2)) (BIO) 
One similarly defines Ri{a) and i^iia) through 

a. 

R,{a) = J2 ^b*("')e"^'"' = e-'''"e"'(") (Bll) 



which imply 



and 



Z*{a) = e^'"(i?,(a) - i?,(a + 1)) = e"^'*") - e-^'e^'-'^+i' (B12) 



where 



i^^(a) = z^.(a + 1) + log(^' + B' + C' + D') (B13) 
A' = e-''' 

/c=l 

D' ^ CTie-^'f^+i) (B14) 
with corresponding boundary conditions 

y,{N) = \ogZtiN)^-\og^i 
MN-1) = \og{e-''^Z^{N) + ZUN-l)) (B15) 

The knowledge of the /Xi(a)'s and of the i'i{a)'s, that is of the Z^{a)'s and of the Z^(a)'s, enables us to calculate 
various thermodynaniical properties of interest, including the probability p{a) of pairing of basepair (a) (see eq. H16f) '). 
or the fraction of bound pairs 9ps- 

3. Connection with Poland's approach 

Poland ifl] has derived recursion relations using conditional and thermodynamic probabilities. The link with the 
present approach is made clear, if one rewrites eq Q as 

Ztja + l) , Ztja') ,,,,, , M{N-a) 

e y / ^ + ^5 > . A/(2(a - a)) + erg — — — — = 1 (B16) 

cx'—a+2 

Equation ljB16|l is term by term identical to Poland's equation (lb) on conditional probabilities. Similarly, inserting 
equation Q in equation leads to Poland's equation (10) on thermodynamic probabilities 
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APPENDIX C: BOUNDARY CONDITIONS FOR THE GENERALIZED MODEL 

The boundary conditions for the iJ,i{a, /9)'s corresponding to equation (|33|l can easily be found from section (jlll B 1|) 
and from equation H32|) as 

1 _ e~~" 

fJ.^{a, 1) ^ -log fi + log — (CI) 

1 — e 2 



for a = 1, 2, • • • , A^i. and 



for /3= 1,2,-- - ,N2. 
We also have 



/x,(l,/3) = -log// + logi^^4- (C2) 

1 — e 2 



^,(«, 2) = - log^i + log( ~^ + e^("'-")^;(«', 2) ) (C3) 

e 2 - 1 



for a = 2, 3, • • • , A^i, and 



/..(2,/3) =-logA^ + log( ^- + ^ e^('5'-«z;(2,/3') ) (C4) 

6=^-1 /3' = 1 

for /3 = 2,3, ••• ,iV2. 

The boundary conditions for the i'i{a, /3)'s corresponding to equation (|37|l can easily be found from section (jlll B ip 
and from equation H36fl as 

1 - g-^(JVi-a+l) 

z/,(a,7V2) = -logM + log '- — (C5) 

1 — e 2 



for a = 1, 2, • • • , A^i. and 



1 - e" 3- (^2-/3+1) 

1^, (iVi , /3) = - fog ;x + fog '- 3- (C6) 

1 — e 2 



for /?= 1,2,-- - ,Af2. 
We also have 



1 - p-^iNi-a+l) ^1 ^ 

^,(a,iV2-l) = -log/i + log( ^- + J2 e-^("'-")Z*(a',iV2-l) ) (C7) 

P 2 — 1 / 



for a = 1,2, • • • ,7Vi - 1. and 



1 _ -^(N2-(}+l) -^2 

i^.im - 1,(3) = -logM + log( + e"^('3'-«Z*(7Vi - 1,/?') ) (C8) 



for/?=l,2,,--- ,7V2-1. 



6 2-1 /3'=/3 



APPENDIX D: THE MELTSIM STACKING ENERGIES 



For complementary base pairs, the stacking energies we use are the ones of the program MELTSIM. They are 
written as ea,a+i;i3,i3+i — —12.5047 T(n{a),n{a + 1)), where 71(0;) denotes the chemical nature (A,T,G,C) of base 
(a), and where T(n{a),n{a + 1)) has the dimension of a temperature. The factor 12.5047 is the entropy loss due to the 
formation of a base pair divided by Boltzmann's constant (12.5047 = xw')' B^'^^s (a) and (a + 1) belong to strand 
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1, identified as going from 5' to 3'. If s denotes the salt concentration, the effective temperatures T(n{a),n{a + 1)) 
are generically given as 

T(^n{a),n{a + 1)) = ao{n{a),n{a + 1)) logs + bo(^n{a),n{a + 1)) 

where flo and bo do not depend on s. In our simulations, we took s = 0.0745. For this particular value, we list below 
the effective temperatures T{n{a), n{a + 1)) with A = 1, T = 3, G = 2, C = 4. 



T(l,l) 


= n3,3) = 


339.68 K 


T(l,2) 


= r(4,3) - 


353.32 K 


T(l,3) 


= 341.72 K 




T(l,4) 


= T(2,3) = 


378.83 K 


T(2, 1) 


= T(3.4) = 


357.96 K 


T(2,2) 


= r(4,4) = 


372.73 K 


r(2,4) 


= 408.99 ii: 




T(3, 1) 


= 326.65 K 




T(3,2) 


= r(4,i) = 


341.30 K 


r(4,2) 


= 361.49 K 





For non complementary bases (e.g. in the case of point mutations), we take £a,a+i;/3,i3+i = 0- 



